Oops. Mm-hmm. The rapple also. No. Okay. Mm-hmm. With the link here in. Mm-hmm. Mm-hmm. Mm-hmm. Of the value, yeah, mm-hmm. Mm of the previous r uh page rank or of the Okay. Mm-hmm. Mm-hmm. Mm-hmm. Uh-huh. Mm. Okay. Mm uh-huh. So wait uh, I don't get it. You you have two representation, one which comes from this link uh stuff, and one f this one is not based on P_L_S_A_ too? Yeah. Okay. Y A T_, yeah, uh okay. Yeah. Yeah, okay, okay. Okay. Mm-hmm. Ah uh okay. Yeah yeah, it's okay. Mm mm mm. Yeah. Or they use the perplexity on the the likelihood to say oh good, this likelihood is uh is it's smaller than the other. Mm. Mm-hmm. Mm-hmm. Uh it's the same. I it's making the assumption that uh this human categorisation is related to the one that is discovered uh mm. Mm. Oh, yeah. No, but But I have the feeling i in But in this case you you have to use uh the label one, yeah. Your los But how does it enter in that? I i you have to train your uh Your training data. Yeah, but the point is that you should change in a way that to take the label into account, or not? Okay. Okay, okay. Mm-hmm. Mm-mm. No, but w But i No. The the there is not uh v uh perfect uh clustering clus wh what is the how do you judge uh what is a good clustering? And you're not exactly compar uh. You're comparing them Ah, okay okay. Yeah yeah. Yeah. Also like a baseline, yeah. You m you still not combi No no, but it's another Mm. Mm-hmm. Mm. Mm-mm. In fact I I have seen a paper in I_C_M_L_ where the guy was trying to learn which word he um aspects of clustering, so I give you I give the real users mm the person um a bunch of marbles and they begin to say okay, and they have to divide it in five group, and in the end it's not al it's it's always different to the clustering each people has, because it's completely subjective. But in but there is uh some characteristic that are for transparency, colour, which are um features that made the clustering, that can be extracted even if the clustering it's always different. So so that's only to say that uh it's it can be really infinitive clustering. Mm-hmm. Say another another example apart from uh this uh search. Uh I don't get it exactly. What you mean, then don't don't use a Google page because it's uh too uh s root, but uh another page mm. Mm. Mm-hmm. Oh, it's not that People being your algorithm? That's the point? Mm. It's getting Mm-hmm, exactly. We're not trying to to fit yeah. Mm. You had onl two topics? Or no? Yeah. Yeah, yeah, yeah. Mm-hmm. Mm-hmm. It was linki, mm-hmm, mm-hmm. So it was a a good point for you. Mm mm-hmm mm. Like this, yeah, mm mm mm, okay. Mm-hmm. mm. Mm-mm. Mm-mm, probably not, yeah. Mm-mm. Mm-mm. Mm. Yeah mm. And Mm-hmm. Yeah. Um this alpha, i can it be dependent of the of the document? And how many text is in it? Or something uh yeah. Mm. Mm-hmm, so mm. Mm-hmm. Mm-hmm. Oh, of course. Yeah, mm. Mm. Yeah, yeah, w w I think we've agree on that, yeah. Mm-mm. Mm. Mm probably heuristic it will be easier to compute because Mm-hmm. Mm-hmm, yeah, mm. Mm. Mm. Did you manage to to divide the link uh properly or Okay. Mm-hmm. L_D_A_. Mm-hmm. Mm-hmm. Mm-hmm. Mm-hmm. Mm-hmm. Mm pyramid, yeah. Ah, okay. Yeah. Mm-mm. 'Kay. Mm-mm. In two months, yeah.
You never know whether you've got these things on right or not. You never know whether they're on right or not. Feels wrong. Okay, so I don't need that. Okay. So David came in to me and to say uh i something we'd been talking about for a while was some stuff I'd been working on about using these topic models with uh with networks, with n things with links between them, documents that are linked. Uh n uh it's actually it's related to rapple, but it's not actually that th this is something I haven't actually been working on for a couple of months now, uh just because I came up with something more interesting that I've been working on, but uh th at the time basically I'd been trying to use P_L_S_A_ and those types of models on web pages, and that came up with uh a big problem in that most web pages have very very little text on them. But you can tell what they're about based on the other texts on the site, the texts on the things they're linked to, so what I wanted to know was can you combine the topic models that the pages, that have lots of text on them, with something else that that tells you what a page that has no or very little text on it based on the links, yeah. So quite simply if you have a page one which is like a home page, and that'll maybe have five links and a picture on it, and then because it links to page two and page three. Can you tell what this is about? Yeah, so I started looking at uh Google page rank, which kinda does this by saying well, the links go the other way, it says if this page is about fish or in Google, they say if this page is good in some way, and it links to this other page, and we don't know what this is, then that page is more likely to be good. And it's just page rank is just a simple algorithm for doing this over the whole big network. So I wanted to combine the topic models that tell you beforehand what you think this page and this page are about with this to maybe get a better idea. So the idea is say you've got some text on a page, that's only part of the picture about what of what the page is really about, and the links pointing to that page are also part of the picture. So I basically came up with an algorithm that's just a i it plugs P_L_S_A_ uh plugs page rank into P_L_S_A_, right, so page rank you basically just d uh say You basically take the how good this page is, which is X_, and how good this page is, which is Y_, and you then say this page is based on uh how do you do this? Based on the the links coming into it, uh it's just a like a sum over all the the links of uh well, X_ link I guess. O of of the value. Uh uh the but it's not it's Yeah. If if so if this is page rank of X_, page rank of Y_. I think this is roughly it off hand. Page rank of Z_. So you say page rank of Z_ is equal to uh one over a constant to normalise it. Uh sum over all the links into Z_ of page rank of the link. Yeah, tha it's like that, yeah? So similar idea. If you got a vector that says what the topics are, say you've learned from P_L_S_A_ a d a distribution of topics for every page, so if you say a probability of topic one for uh for page uh Z_ uh is equal to, and you can use the same sort of thing. Right, so nicking a whole bunch of maths from how you work out page rank, uh it comes up with uh just uh an iterative algorithm, which is just uh the current probability of topic one given Z_, so that's probability old plus uh i yeah, plus that and then you normalise between them, so you say alpha and one minus alpha. Yeah. So w basically what you're saying here, one minus alpha say link contribution you can call this I guess. So what you're saying is that a page is partly from its text that you've got on the page, and it's partly from what the links say it's about. So that alpha thing is how much do you trust the text on the page, and how much do you trust the links. So that alpha could be different for every page. So what I did then was try this on a big database of web pages. Uh and I had some results from that, but they were all on temp two, so that's not important right now. Uh but as it is not really important, I can regenerate that. The problem was that this gives you a new vector of topics for every page. So you've got two vectors, you've got the one that you've just got from P_L_S_A_ or whatever model, and you've got the one that you get from this. Uh how do you tell which one's better? I had no way to evaluate this well. How do you tell that this page because you don't know where the topics are, 'cause they're learned automatically, uh so It it is, yeah, because uh b yeah, you initialise it with uh actually thi this is kinda wrong, this is not old, but start this is how you this is how you you start it off. So this this is what P_L_S_A_ gives you to start with, and then you iterate this and it becomes something else. Yeah. Mm-hmm. That's it, exactly, yeah. Exactly, yeah, so you've now got you've got P_L_S_A_ gives you two probabilities, right? So it gives you a probability of a topic given a document, which is Z_ given D_ and a probability of a word given a topic. So this is the same for both models. Yeah? Well yeah, what you do is you're taking this and iterating over it, the probability of the topic given the document. Uh so have I made that clear, I'm I'm not sure if I'm explaining this well. Okay. Right, so y i you m Yeah, so we were struggling at how do you evaluate this, because uh I mean if you look at the way that P_L_S_A_ and all those kind of models have been evaluated, it's th they never quantitative quantitatively do anything, they always say uh well, look at the clustering, there's here we've got some documents that all appear to be about the same thing, and here's where we get some other documents that all appear to be about the same thing, and you can't really tell exactly how good that is, you can just get an idea that it's Yeah, so tha that's the first thing th that thing when I spoke to you about this, you suggested d use the likelihood and Pedro came up with a really good example why that could be bad. So you have uh you have the uh oh, let me thing of what this was. It was to do with cars basically, if you have you have a topic that's uh representing cars, and you can tell that that's representing cars by looking at this distribution, and it's the same for both. Now you wanna tell which i i this page, uh Pedro's example of what it was, is that is says on the page Lotus Elise. Okay? And that uh no f I can't I just remember it was a really can you ig remember what it was? No, okay. Yeah, I didn't really consider doing it on perplexity, actually, that to be hon but okay. Okay. Is that fair though, because I think one of the things with these topic models is that you don't know you can't make them cluster Okay. Hmm. Yeah. Okay. And that is is that really fair though? Can you say that if a set of documents if the clustering of these documents is closer to the effecti what you what you've got there is uh a clustering of documents a as your devaluation, yeah? Uh thi this these labels, the Yahoo categories, that's that's a clustering. So you're by saying that this clustering is closer to that clustering is is does that necessarily mean that it's better? Mm-hmm. Okay. Mm-hmm. Mm-hmm. Mm-kay. Mm. We uh I think your point was that you you train this without labelling first, yeah? Mm. Mm-hmm. Well th yeah, the the thing that worries me about this approach is that say the really really really simple example of this where your directory had two categories in it, uh good and bad, and and and you run P_L_S_A_ on it, now that doesn't do it good and bad, it does, I dunno, English and French pages, so they don't match, but the clustering is still very good in in one respect. Uh but but you you can extend that up, even if you have a hundred categories, P_L_S_A_ can find a hundred clusters that are a good clustering, but are completely unrelated to the directory. Hmm. But but what I wanna know i is that which of them has the better representation of a given page. So so given a random page, uh and these two models, uh and the wo the words that these two models give, which is more likely to Mm-hmm. Mm. Mm-hmm. Mm-hmm. Okay. Yeah. Okay. S okay. I'm I'm just n concerned that that's uh that that's it's answering a question that evaluates in some way, but maybe not in the way that Well that maybe not is the same way I was thinking, but maybe it's a better way um so what I mean by that is that sorry, I'm trying to process all this. Uh Yeah, i it's the same thing, like it it does tell you whether this uh which clustering is better for this task, but it doesn't tell you which clustering better represents the pages necessarily. But it it's it you you could say that it possibly doe it probably does. Okay. Mm. Yeah i Mm. Okay. Mm. I think mm. Uh no, I th I'm I'm coming round to it actually. No, I'm so I I I think I g uh I I get it. Uh I I th I think it probably is fair, I'm just I I've gotten um uncomfortable with it, but I think that'll go if I think about it enough. Uh but but yeah, the it it but it just seems a little suspect, but I think that's just from the whole point that you can't tell which clustering is b better. But uh what I'd thought before was could you get people to evaluate this by by hand basically, could you give people a set of pages and like say you did it with five topics for example. Uh now you give people five sheets uh you give uh a t a a user five sheets of paper, each with words in varying sizes representing the probability of that topic for example. Try give the user these pieces of paper to get an idea of what these topics are, that's this P_ of W_ given Z_. And then you give them some pages, and say which one would you which piece of paper would you put that with, and then you get like an intuitive ide uh like how would a person cluster pages given the what these topics supposedly are. 'Cause i you've got something that's fixed, the the P_ of W_ given Z_ with respect to both models. So I was tr I was hoping you could use that to Mm. Mm-hmm. Mm-hmm. Yeah, okay. Hmm. Okay. See, the the thing about web pages I thought as well in that getting humans to evaluate it in some way is that you have most of the web pages that are uh well a lot of the web pages that are important in some way are are are home pages, so Google dot com doesn't actually say the words search on it, I think. I maybe it says go, but you only know it's about search through the thing, but someone looking at it can tell that it's a search engine, because of the way it's laid out, because of the pictures on it and because of all of that. There's a lot of information that's completely lost in this bag of words representation, and I was hoping that in some way you could use that to evaluate this. I have no idea how, but uh yeah. Mm. Yeah. Mm. Yeah. Okay, I think I know it. But it yeah, thi this is effectively what we have here though, and th that directory is one clustering of the marbles. And you're hoping that it it so what you're saying is that I have two clusterings of marbles for my two models, and I have a test clustering of marbles. It's saying that one of these is more close to the test t t to the the one that we've been given, is that it's saying that what does that actually say? That doesn't uh it says it's g better for this task, but it doesn't really say anything else. Mm-hmm. Okay. Okay. Okay. Uh and so and so if I get that right, what what the algorithms are still coming up with is in this way you're looking at a clustering of marbles, yeah? If each of the algorithms comes up with a defin Ah okay, right, that makes sense. Okay. Okay. Okay. I'm convinced, I'm convinced, alright. So it's good. Right. Uh alright, well thanks. Um I'm trying to think of anything I haven't said about the stuff, but that uh from looking at the results I got it looked quite good, um just like hand reading them in that one of the things I did was I started a i a web crawl at the IDIAP homepage. It did not very many pages, only like two thousand or something like that, but enough to get the pages that were close to IDIAP to have a lot of links between them. So that there was enough to be able to run this link clustering algorithm. So what you find is that the pages around IDIAP, there's a lot of English pa speaking pages and a lot of French speaking pages, so of course this uh P_L_S_A_ completely separated those out into into categories. Uh so when I looked at the the word likelihoods, there were two topics. One was very clearly English words that most likely word was the, and one was very clearly French words most likely words was de I think and uh No, I had many topics. The others looked like garbage, couldn't tell what they might have been. But anyway, uh tha that's often the case, when you look at it you can tell very clearly that one or two topics are a b not the rest, yeah. It is. Ok Hmm. That w uh anyway, that when I run it on that, uh what I found was, looking through some of the sites uh I looked for the sites that the models differed on, that where one said it was probably an English phrase and the other said it was probably a french phrase, and I found a good few examples where you had a site that looked uh well one that I remember off hand was it was a page and it was a big picture, and links down the site, but the links were all pictures. So you couldn't the the words were in French and the links, but there were no words picked up by the when I did the crawl for the model. And there was a small bit down here, and in this there was uh meant to be picture in there, but it didn't load uh or uh b it was meant to be something else, and it said uh in English it said four O_ four this page cannot be found. So all the words on this page were English, but it was very very clearly a French page. So the first model said it was an English page 'cause all the words were in English, second model said very definitely a French page because all of these pointed to French sites and it was only pointed to it was something dot F_R_ as well, it's Yeah. So I I found a couple of examples that were kinda like that. Mm but the the problem that I found when I ran it, uh the th the thing that was a bad point was that if you have uh say tryin I was trying to say if you could evaluate this with search given some keywords like which model would more likely find you a p good document, that that sort of approach. Uh what I found was that say you had uh Uh th tha that's not too relevant necessarily, but if you have say a topic distribution for uh one page, and it very very highly scores on the P_L_S_A_ for one of the the topics, and that's right, okay, this model, because it's averaging out over all the pages nearby, all of the topic distributions become more flat. It it becomes more uncertain about everything. So I don't know if that's necessarily a good thing or not, yeah. In in some cases it's gonna be a bad thing, b because ev every page is linked by the way the algorithm works, every page is linked eventually to everything else, so it every page has information incorporated in it. Or i the distribution learned for every page has information incorporated in it, from every other page's distribution as well, so it averages out, it flattens all the distributions, depending on what this alpha is. Mm-hmm. Well I I think that the thing is Yeah, that that there's a balance in the uh for example take a page like uh Slashdot, and it's like a news site uh for geeks, and you have uh nerds, yeah. News for nerds, stuff that matters. Uh all of the like hundreds of thousands of pages point to that, but they're about completely different things, so you in effect are losing information about what that page is about, because there's so many things pointing to it and they all average out as Yeah, so that Th that's my first idea w was how I set this I remember for the Lotus thing wasn't there? Um so I thought you wanna probably set this alpha per page, there's probably some heuristics, for example this page, there's no text in it, will set it very low in user information for the links. Problem is, this is what Pedro immediately said, what if you have a page that looks like this? It's got a big picture of a car uh Anyway, and it says Lotus Elise, so you know that it's definitely about car, but it's only got two words. So you can't just do it by words, because those two words completely specify into something or other. Uh su you suggested doing likelihood, but at that time and I can't remember what my problem was with that. It might not have been a good thing, but you've got two likelihoods for it as well. No, maybe not. Mm. Yeah, maybe do it by cross-validation, something. Okay, there's Mm. Hmm. Yeah, that that is fair. Yeah, okay. It yeah, d No. Yeah, i so uh I I think that that's too easy. You can do it with some heuristics and you can learn it from the task results in or combine the two, so tha that's kinda straightforward, I think. Yeah, the the the thing about doing it through cross-validation or something like that with the task is uh the problem with doing a page rank style algorithm is that is that it gets better it it converges better and it gives you better results, the closer a network is to being a small world network. So small world network mean basically some far links, but generally very clustered locally. Uh now the thing about doing a web crawl is, the bigger your web crawl, the closer it'll be to the overall structure of the net n the internet, which is a small world network. The smaller your web crawl, the less small world network like it's gonna be. So the more pages you get, the better your results. And people running this style of algorithm, the page rank style stuff, generally run it on about two million or twenty million pages, or whatever. The the Stanford web base data set that I was using for this, for for a bit of this um, they have a a d I can't remember off hand, but they have m maybe a bit over a theta byte worth of web pages in this, a bunch of servers. Uh Yeah, maybe Wikipedia is a good thing to do. Was I was concerned about Wikipedia though was that it's while it has a high level of inside links that has a lot of directory style links as well that might mess up it actually being a see wh while you have Ah okay, that's good. That plus I guess it's actually straightforward to say to start with is this a small world network and then run it on that. You were running that on what we what was it, like half a million pages, something like that? Is it and that was for half of it? Uh if I remember. Okay. Okay, the whole thing was Mm. Mm-hmm. Okay. I mean I think off hand thinking about this model it would have to work better, because two things in the same category are going to be linked. So it it yeah. Okay. Yeah, that's I think tha it's gonna be more true in some networks and in some tasks than others, and I think in the Wikipedia it's probably gonna be it's a good one to test on. Mm-hmm. Okay. Okay. Hmm. Mm-hmm. Okay. Yeah, it I'll maybe I'll have a look at the Wikipedia stuff anyway. I'm stuck doing something else for the next month or two anyway, so. Uh this is always the case. Uh the there's another thing, um uh this was new but I haven't thought about this too much, but it's on the same lines in that uh have have you guys seen this uh correlated topic models work? Correlated topic models. It's David Blei, the guy that came up with uh with L_D_A_. He's got uh a new model that's it's published in NIPS this year, but his co-author put it on their web site a month or two ago. So uh you know L_D_A_? D you all know L_D_A_ or yeah. Uh okay. So a L_D_A_ you basically draw a a Dirichlet which says like what type of document this is effectively, and then you draw a topic given that Dirichlet distribution and then you draw words given that. Alright, well that's for for every word and that's once for the document and that's you've got a parameter coming at this globally. So that that's roughly what the model is, plus other little bits. But anyway, so what they did was they they reworked this because one of the problems with L_D_A_, and I think P_L_S_A_ has this problem as well in is that uh the more topics you have, the less uh expressive it can be, because the topics it cluster it it pushes a document if it is pulling a document into one topic, it pushes another topic away from that. Doesn't that make sense? So the topics are necessarily trying to be different from each other. So the more topics you have, the more th they're sort of squeezing into the space of what things can be about uh and things are kinda pushed out, so once you m go over you try it one corpus, and once you go over about a hundred, maybe two hundred topics, uh the the new you add more topics and it doesn't get better in any way, the model doesn't. The n new topics are effectively going to just be garbage. Uh, alright, so what they did was they they they stopped using the Dirichlet and uh they're starting to use uh a Gaussian there. So it it in to spare the details basically, they've they've got a variable here, I think they call it eta uh and they from this generate a distribution of topics in some way, but it's generated from a Gaussian space, from a from a a space, so it's given a a mean and covariance uh and Yeah, uh so what it does, the this is the the bit that took me forever to understand actually, uh same model, but this is given a Gaussian space it maps that down into the simplex of what a topic can be about. The so the simplex basically if you've got a say you've got two topics. Now say i i say you get three topics, that makes more easier to draw. Uh for this is theta one theta two theta three, these are the probabilities that that a document is about uh this theta one is the probability that this document is about topic one. Theta two is so these guys all have to add up to one, yeah? Uh so that basically means that what a document is about lies on this space between them, yeah. So this is the simplex. Uh so basically a distribution generated by this guy is basically a point on this simplex. So what they do is they've got a distribution that maps uh a Gaussian and some space into a distribution on this thing, so it maps it into Gaussian in that. The cool thing about that is that if you've got two Gaussians, you can tell how close they are. So you can now tell how close two topic distributions are, right? So you can have as many as you like, uh and you can tell that they're close to each other, so there's load of ideas that I had about this uh in that if you can tell that two topic distributions are close to each other. So for example if you split a document into two parts and said that the two parts have to be close in the distribution, you can see how the doc t the document changes, for exa stu stuff based on that. With this idea, what if rather than working with this P_ of Z_ given D_ thing that you had from P_L_S_A_, but what if you worked from these Gaussians that you get for every document and you said that in some some function that said uh that the Gaussian representing this should be similar to the ones linking to it. Now I haven't thought about any of the maths for this, but but it should be uh straightforward I guess to to formulate a basic in terms of the mean should be similar and the covariance maybe should be similar as well. So it might actually be a better model again. I dunno. I'll send you guys the link for that correlated topic model stuff if you want, but uh I don't know if it's too interesting to you. Okay. Alright, well than thanks for hearing me out and the idea especially. Maybe I can do more with this. Alright. Yeah,. Yeah, b Bastien, you can't release this data for four months.
Yeah. Okay. Okay. Okay. Yeah. It is the initialisation of. Yeah. But then it is iterative. Okay. So you you should first train the P_L_S_A_ stuff, then you initialise this link uh spreading thing, and then you iterate over okay. And it converge and so you have the boost uh initial P_L_S_A_ and P_L_S_A_ plus the link thing then. Yeah. Mm. Yeah. So th yeah, you you you were saying that you would like to evaluate that now. Yeah, right. Hmm. But but maybe you you could have some some other labelli li or the task like if if you imagine you have some labelling on of of the web pages like like like for example you you look at at at at at a set of pages, and you have the m the directory entries of these pages, so so f like Yahoo or entries, so you have categories and and entries, so for example you have I dunno, cars, S_U_V_, etcetera. So what you would like would be that two pages on in the same um category should be closer than than two pages being spread across different categories. But you you would not use this you will not use the Yahoo directory or n the directory in the training step of i of the model, okay. This is something you have just for evaluation, and then for example you you your task would be to say I have some page and and I know where it is in the in the in the directory, and I have another page, I don't know where it is, and I would like to to know uh whether it is in the same category as the first one, just by comparing their uh aspects. If they have the same uh yeah. So basically what you would like to check if is is is there some similarity in the in the in the aspect distribution, tells me uh something about uh the sim the the fact that they could be on the same category in the the this end end label categories. Yeah. Yeah. Yeah, yeah. Yeah. No no no no no. How would So so so w what what you would have have basically would be to have a directory which give you labels for some part of the documents. Okay, so this is a directory and this is some some some some web page, okay. So the labelled one, okay. And you have some page which are unlabelled but you have links between those two. So what what you would do is that you would try and P_L_S_A_ and and uh and use a model over that. Then then you would like to for example what f what could be very simple would be just to then to match each pages like that. So let's say you have you have a a page which is labelled, and you say okay, the the label page two should get should be the one w which it is the most similar to, or any kind of classification rules you can imagine. And so and so you say okay, what what's if I from from this page which have the similar so so the the criterion would be to compare those two distribution, okay, and then and then you would like to assign to this document too where you don't know the labels the same one as this one. And and here, that you use only for test, so this is your. Then you can evaluate how wrong you are or cu how close you are from the real assignment, the right directory category. In this way I think it's fair if you kn if you know only if you know only that and that and what you would like to discover is this. This is a r like a real task. Like like if if if this directory uh uh company would like to extend this directory to new pages without the hassle of having human labelling. Yeah, you have t you use this one. You you only have the labels for some uh small part of the So So you you would train it you this is this is t t all all those links are known. The links between some labelled document and unlabelled document. I d I think you can do it as a two step process. First you you train uh well maybe s it wou could be better if if if you take it into account. But but maybe you can you can do just a two step process, first one you have this uh unsupervised task over both P_L_ and P_U_, and then you from these unsupervised starts you can know which document of P_L_ is close from the document of P_U_ or the reverse. And you can infer some label to a document of P_L_. Yeah. Yeah yeah. Yeah, but Yeah. Yeah. But but here you can you you say given a random page of P_U_, which of those two model is more likely to to to to to say for example that thi the page of P_L_ which is the most similar, so you so so from this so you take any distribution comparison things, so so Hoffmann in his paper was doing just cosine similarity, but you can do anything. So so you just compare uh those two distributions so of D_ one and D_ two, okay? And so with that you can you can determine for example for a document o of the set two to the most similar So you you you you select D_ one so in this set which is the most similar to D_ two according to the distribution from one or the other model and then and then for this document you automatically assign to it the label of the document uh one and you check with this which is your uh your your if you are right or not. This is what what this is very related to what Frank did with the with the keyword uh image annotation. So you would have a set of keywords, the set of image label with keywords, and he has a set of unlabelled image with some, and it would check whether the the label he assigns, so in this case annotation word h are correct. And I think it's fair uh it's fair uh setup. So maybe this classification rule is not the best. There there's some might be some work to do, but the general idea, I think it's it's fair evaluations, it's a task. That can be dep Yeah. Yeah. No, but according to this directory yeah. yeah. I I I think it's w Yeah. Yeah. Yeah. What's good with the directory is that you have a good coverage of various topic humans humans can label. But then if you want to detect different languages, this is another task. I think the task of this latent model is to discover some some semantic like some topic proximity between documents, and here this la this directory is doing the same thing, like associating some topics with some documents manually. So comparing them is is is is good setup with respect to the goal. I compare those two, like the P_L_S_A_ and the one over this same task of assigning good directory. I don't know if if it if it will provide good result result or not, but it is clearly a benchmark for those two goals. On on the infra yeah, yeah. Yeah. Like bag o yeah. Yeah, you can have tree setups, uh yeah. Yeah, basic cosine over the P_ of T_ given D_, so over all this uh empirical distribution. And the two uh aspect uh models. We have to repeat over and over. This is like advertisement. No no, it's joke. No, it's jus Yeah, yeah. This is really yeah, yeah. Yeah. Yeah. Yeah. Yeah. Yeah. But the point is that the Z_ the the the aspect like the Z_ are are are come unsupervisely. It's it's more like if you were giving the people you you said to them okay, uh divide me this bunch of documents into five topics, and you don't give them the topics a priori, you know. So the two person could come up with very different one is very interesting i in Yeah. Right. Yeah. So I I think I think it's yeah. Yeah. Yeah. But eve ev even someone would would provide very different labels, like like someone would say it's a page in English, someone would say it's a it's a it's a it's a it's a web page uh wh which has very few uh information in it or anything. So I think there's no hard clustering, and what's good with those model is that they they can they they are intended to provide you some some kind of way to compare two documents rather than to give you pre-defined sets of documents. This is not hard clustering. The o I I don't think one one one aspect or one latent topic has any sense by his own. It's when you have the word set and the word distribution that you c you can tell something. If you just take one of them alone, so you take all the P_ of Z_ given D_ and all of the P_ of W_ given giv given Z_, you you you can do anything with that. It's it's the whole set. I would say an an and this is oh it it should be evaluated, meaning that like for example uh the the example about marbles is that the point what you can learn from all these sets made by people is that is that those two those two marble would be consider as more likely to be in the same sets, because they they share some common properties which are not seen as the same for every one. But they share some common properties that people will notice. And if you look at the two the d all possible pairs of marble, you would have some mu much more agreement between people than than if you look at the predefined sets people would would just cut at some precise uh point. You know what's Yeah. Yeah. Well you don't you don't really have two clustering no no no no, you don't you don't really have two clustering. No no no no, you don't have two clustering of of of marbles. What you have is y is you have is you have one clustering of marble, which is this one, and then you have lots of people telling you I consider those marbles as very uh have to have very salient feature that they share in common, I I c I would like them to be in the same cluster. And you c and and and those two, so and one of them is labelled, and the other is not labelled. So you look at that, you say oh, someone told this one was very similar, and this one is said to be blue for example, I dunno if it's the label. So I put it in the blue bag. But but you don't you never use the blue bag, you don't ask the people to look at the blue bag. You just look at the people you you just have the people to compare those two marbles. And then it happens that some of them have been clustered. And and Yeah. Yeah. Yeah, this is I reverse both role. The this is the players. I i no no no no no. It comes up with not with the clustering but with some way to compare two marbles, okay. Some would say would say I consider those two ones are completely dissimilar, and the other ones say they are very similar. And if you find that in your unl unlabelled clustering those two uh balls uh marbles happen to be in the same uh cluster, you said the user algorithm might be wrong. But you average that over lots of documents. Yeah. Yeah. Yeah. Yeah. They make sense only so global optimisation, the likelihood is optimisable globally over the whole aspects, etcetera. So it's it's you never ask the model to do hard clustering of documents, you just uh ask uh it to find a distribution over aspects such that a document uh is more likely according to the restriction that the number aspe aspect is limited. And and so it the it just makes sense on the uh looking at the whole set. So y I think all evaluation should take that into account and look at the whole aspect set. Like here you look at the aspect set because you look at the similarity of two distribution. An any st any kind of thing you where you un unfold the model and look at it manually, looks weird, like like in those Hoffmann papers where show you columns with words or things. Okay. Okay. Okay. Yeah. Yeah. Yeah. Yeah. But also it depends on that could be a good effect. Because if you if you can be if you can if you are uh highly reliable about one page, but all the page linking to it are actually about very different topics, so you could say maybe I was wrong from deciding only due to the text that this page was about this aspect distribution, because because all the other pages are voting in different direction. While there's a balance Hmm. Yeah. Pedro's always speaks about cars. But I think for any kind of way you have to select those alpha you you have to look at the effect on on the re on the t evaluation task. And then maybe you can find some patterns that are then then once you have the once you have the task, you could look at at which pages are completely missed by one model, which others are are successful with one model, and and you can maybe infer some some rules afterwards. But if you have no task, you can uh blindly select s good alphas. Like like if if you look at uh if you look at your task and you see that, I dunno uh, the the long pages for example have good P_L_S_A_ model without fusing links, so then you can infer some heuristics like okay, the alpha should be proportional to the lengths or anything like that, or to the number of image in the pages, or anything. But you should have some some task where you can individually look at the performance on every on every page. And and and and at as a second step you could infer some kind of rule about how to select alphas with respect to the task. Maybe it's v it's very simple rules, but select them blindly s might be odd. Yeah. Yeah. Yeah. Yeah. Yeah. But but but still li like the the the work I did on on hyperlinks, like I used uh Wikipedia, i I think it's g because it has fairly high level of inside links between Yeah. Well you can categorise them because they are th it's there's a structure on the on the web page. So you you can select only the links which are within the article or within the categories or it's it's it's well-structured, it's it's uh easy to manipulate and Yeah. You can even isolate the directory from the categories of Wikipedia. Well yeah, actually I I I divided in three equal size parts which have one hundred fifty documents. There are a thousand documents. And well, but Well, what I did with the link well, but this is not th exactly the same setup you would have. So in this setup, the whole link structure could be known, and the only thing you wou don't you w don't know would be the assignation to uh to a category. In my setup it was different, because I I sh I couldn't use the links between different sets. But this is not I don't want to mess you up with that, it's not the same kind of experiment. But you could just use the whole corpus like this, and and for some of them you assume you don't know the the category assignment. And then you can run this set of experiments over Wikipedia. Yeah. But this is the yeah, this is the hypothesis between this uh in this P_L_S_A_ with link stuff. Oh yeah, I thi Yeah. But you can you can think also at at other uh task like that. Like for example you can take proceedings and assume the categories are the keywords of the of the d of the documents, and then if you have the archive of a journal, you can you can use the l the links between uh the citation links. So if you want to have a bigger stuff so Wikipedia would be kind of very small with lots of Engli in links, and then if you take a journal you will have less links and and more documents. And then you would be likely to say that P_L_S_A_ might be better over that, because you don't have enough links. So you you you can find examples which are not the web I think it's good to work in closed set. Wikipedia is kind of closed set, ma even if there's lots of outside links, but there's lots of inside links. You have to think at at at at at at at database where there's lot of in links. Wikipedia might be one, citation, for example if if you take the NIPS arch archive, which is available. NIPS author always other NIPS author, which are often themselves, etcetera. Well, it's reported, I I won't say. So but but but but yeah, I think th th this kind of setup maybe web pages it's hard to find, sets of web pages which have good links stuff if you start from one page or something. Okay. But it's discrete uh no, the Okay. Yeah. Yeah. 'Kay, yeah, thanks. Okay, yeah. Well there are lots of good ideas that are recorded and someone will Yeah. Because the data like will be released very soon.
So is it correct or no? Hmm? Yeah. Well, so we hope. Mm-hmm. Mm-hmm. Yeah. Mm-hmm. Oh, this is is frozen frozen okay. Mm-hmm. Yeah,. Perplexity, uh. Yeah. Oh, no re no I don't even remember the the you said I would tell you that evaluation should be done based on likelihood or perplexity. Well it's bu yeah, it's basically the same, but I dunno. Yeah, another task or Yeah. Mm-hmm. Yeah. No no no, i no no no no, it's doesn't have to be that if if you use the latent representations or distribution over aspects, you could use this to save two documents uh in the same directory or not. And this you should be able to evaluate then. Um Yeah. Clearly wha we we had done these. You can train after that uh an S_V_M_, saying this is category one or two on the label document, and then we will have an unlabelled document and you just say this is directory one or two. And if it should improve it, if you say this feature extraction process is better than you can have a better cleaner separation between those two. Yeah, well but Yeah, well, but I mean if you have to judge it's with respect to a given task, I mean you can't say of course it's always good, because it's maximising the likelihood of whatever. Yeah. Mm. Yeah, it's the same. Mm-hmm. Fus. Mm-hmm. Yeah. 'Cause I don't think there is any way to say generally this this is better or worse. Yeah. It has to be re it has to be really to with respect to this, this clustering makes sense or I dunno. Yeah. You can e even keep the bag of words and do the cosine comparison or Just showing just showing how just first how better this is with respect to bag of words and hopefully then how you new uh regularised P_L_Z_ given D_ is. I dunno just uh yeah it's basic. Mm-hmm. But Yeah. Yeah. Yeah, just know where. Yeah. Mm-hmm. Yeah, yeah. No, it does Yeah. It doesn't have to match. Yeah. But yeah, it just re Mm-hmm. Yeah. Because the the latent aspects really doesn't have to match the final uh directory category. It can be anything but it's just a way of saying yeah, those documen documents are similar in some sense. So Is it just it just sometime dangerous to say th these latent aspects means this. Yeah, I think yeah. Good, good good. Mm-hmm. Yeah, and the others usually don't mean anything, yeah. Mm-hmm. Mm-hmm. Okay. For your model, mm, good. Nerd. Nerd. Mm. Yeah, cars and web. Mm-hmm. Mm-hmm. Which ones, uh? No, no. Hmm. Okay. Well in in three months. Oh yeah, it's too late, three months. We should wait.
